1 About the tutorials

The following tutorials teach how to run clustalw2 and clustal-omega from the command line, along with a few useful AWK, Bash and Perl idioms and scripts.

1.1 ASSUMPTIONS

This tutorial expects that the user has access to a Unix/Linux environment with the standard Bash shell installed, as well as the clustalw2 and clustal-omega multiple sequence alignment programs installed and found in the $PATH. Both programs are freely available for download from the clustal.org

1.2 AUTHOR

These tutorials were designed and written by Pablo Vinuesa, CCG-UNAM, Cuernavaca, Mexico (X - @pvinmex), for the International Workshops on Bioinformatics, TIB2024, his courses taught at the Bachelor’s Program in Genome Science, LCG-UNAM, and the Workshop on Genome Sciences, Facultad de Ciencias UNAM.

1.2.1 SOURCE REPOSITORY

This material is distributed from the following TIB-filoinfo GitHub repository.

1.2.2 LICENSE

Creative Commons Licence
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0


2 Multiple sequence alignments and the CLUSTAL family of programs

Multiple sequence alignments (MSAs) are essential in most bioinformatics analyses that involve comparing homologous sequences. The exact way of computing an optimal alignment between N sequences has a computational complexity of \(O(L^N)\) for \(N\) sequences of length \(L\) making it prohibitive for even small numbers of sequences using exhaustive algorithms (Sievers et al. 2011). Therefore, all MSA algorithms implement some sort of heuristics (see Katho K. ed. 2021), such as the progressive multiple sequence alignment algorithm implemented in the Clustalw and preceding Clustal versions.

Clustal programs are used for multiple sequence alignment of DNA and protein sequences. The software and its algorithms have gone through several iterations, with \(Clustal{\Omega}\) (Omega) being the latest version, published in of 2011. It is available as standalone software, via a web interface, and through a server hosted by the European Bioinformatics Institute.

Clustal has been an important bioinformatic software, with two of its academic publications among the top 100 papers cited of all time, according to Nature in 2014.

Here we will demonstrate the use of the current command-line versions: clustalw2/clustalX and clustal-omega to perform:

  1. Standard multiple sequence alignments of protein sequences
  2. Profile to profile and sequence to profile alignments
  3. Build CDS alignments guided by the alignment of the the corresponding translation products (proteins)

Both \(Clustalw2\) and \(Clustal{\Omega}\) can read multiple sequence formats, including CLUSTAL and FASTA, and write the computed alignments in even more formats, including STOCKHOLM, NEXUS, and PHYLIP.

clustal.org
clustal.org

3 PRELIMINARIES - setting up working directories

The following code will set up our working directories for the tutorials and fetch the required sequences and scripts.

# i) Change directory into your working directory and make the following new directory to hold the sequence data for the exercises
mkdir sesion4_alineamientos && cd sesion4_alineamientos


# ii) use wget on the command line to fetch the required files from my TIB-filoinfo GitHub repo
wget -c https://github.com/vinuesa/TIB-filoinfo/raw/master/docs/sesion4_alineamientos/sequences_for_alingment.tgz

if [ -s sequences_for_alingment.tgz ] 
then
     if file sequences_for_alingment.tgz | grep gzip &> /dev/null
     then 
         echo "unpacking sequences_for_aligment.tgz ..."
         tar -xzf sequences_for_alingment.tgz 
     else 
         echo "ERROR: This is not a gzip file"
     fi
else
    echo "ERROR: sequences_for_aligment.tgz not found"
fi

# iii) Download the required scripts to your ~/bin directory
base_url=https://raw.githubusercontent.com/vinuesa/TIB-filoinfo/refs/heads/master

if [ -d ~/bin ]; then
    cd ~/bin
else
   mkdir ~/bin && cd ~/bin
fi

# This is a Bash array holding the list of scripts to be downloaded
scripts=(split_fasta.pl translate_fastas.pl prot2cdnAln.pl convert_alnFormats_using_clustalw.sh convert_aln_format_batch_bp.pl fasta_toolkit.awk)

# Iterate over the array elements, fetching each script with 'wget -c' and making them executable with 'chmod 755'
for f in "${scripts[@]}"
do
    wget -c "$base_url"/"$f"   
    [ -s "$f" ] && chmod 755 "$f"
done 

# Change directory back to the previous working directory
cd -

# list the directory contents
pwd && ls

We are set to start the hands-on tutorials on running clustal from the command line.


4 Clustalw background and exercises

Clustalw implements a fast version of the progressive multiple sequence alignment algorithm, featuring a fast and simple method for making “guide trees.” These are clusterings of the sequences that are used to decide the order of alignment during the later progressive alignment phase. The general idea is to start with alignments of just two sequences, usually the closest ones in the dataset. The alignment is then built up by aligning alignments with each other or sequences to alignments, according to the topology of the guide tree. The complexity of the guide tree construction is usually \(O(N^2)\) for \(N\) sequences because all \(N\) sequences have to be compared to each other. Earlier versions of Clustal used fast word based alignments for these comparisons which made it memory efficient and fast enough to work on PCs and Macintosh computers with up to a few hundred sequences (Sievers & Higgins, 2018).

Clustalw can be freely obtained from the Clustalw2 webpage. Clustal 2 comes in two flavors: the command-line version Clustal W and the graphical version Clustal X. Precompiled executables for Linux, Mac OS X and Windows (incl. XP and Vista) of the most recent version (currently 2.1).

Clustalw2 webpage
Clustalw2 webpage

4.1 Basic clustalw commands

Here we are going to use only the command-line version clustalw.

Note that the \(clustalw\) and \(clustalw2\) calls are equivalent and execute the same clustalw version 2.1 binay.

Below is a compilation of some of the most frequently used \(clustalw\) options:

#======================================================================================
# Basic clustalw usage
#--------------------------------------------------------------------------------------

# explore the manuals
man clustalw

clustalw -help
clustalw -fullhelp

# standard clustal command line call
clustalw -infile=fasta_no_alineado -align -output=fasta -outfile=archivo_cluwaln.fasta

# clustalw call with modified parameters
clustalw -infile=archivo.fasta -align -pwmatrix=blosum -pwgapopen=12
-pwgapext=0.2 -matrix=blosum -gapopen=12 -gapext=0.2 -outorder=aligned 
-convert -outfile=archivo_aln1.phy -output=phylip

4.2 Standard clustalw alignment of GDP (Glycerol-3-phosphate dehydrogenase [NAD(+)], cytoplasmic) protein sequences from prokaryots and eukaryots

The next lines show how to align the the provided FASTA files with GDP sequences from prokaryots and eukaryots, writing the resulting multiple sequence alignments (MSAs) in FASTA format.

# default clustalw alignments
clustalw -infile=4_GDP_procar_ualn.faa -outfile=4_GDP_procar_cluAln.faa -output=fasta
clustalw -infile=6_GDP_eucar_ualn.faa -outfile=6_GDP_eucar_cluAln.faa -output=fasta

4.3 Profile-profile alignment with clustalw, writing output in clustal’s native clustal format

# A profile to profile alignment
clustalw -profile -profile1=6_GDP_eucar_cluAln.faa -profile2=4_GDP_procar_cluAln.faa -outfile=10_GDP_eucar_prokar.aln

4.3.1 The clustal format

The clustal format can be conveniently visualized directly in the terminal, as shown below:

# display the 10_GDP_eucar_prokar.aln MSA written in clustal format
cat 10_GDP_eucar_prokar.aln
CLUSTAL 2.1 multiple sequence alignment


gpdhuman        ---------------MASKK----------------VCIVGSGNWGSAIAKIVGGNAAQL
gpdarabit       ----------------AGKK----------------VCIVGSGDWGSAIAKIVGGNAAQL
gpdamouse       ----------------AGKK----------------VCIVGSGNWGSAIAKIVGSNAGRL
gpdadrome       ---------------MADKVN---------------VCIVGSGNWGSAIAKIVGANAAAL
gpdacaeel       ---------------MSPKK----------------VTIIGSGNWGSAIARIVGSTTKSF
gpd1yeast       MSAAADRLNLTSGHLNAGRKRSSSSVSLKAAEKPFKVTVIGSGNWGTTIAKVVAENCKGY
gpdaecoli       -------------MNQRNAS----------------MTVIGAGSYGTALAITLARNGHEV
gpdahaein       -------------MITSQTP----------------ITVLGAGSYGTALAITFSRNGSPT
gpdpseu         -----------------------------------------MTSCCGAVAKTRWRN----
gpdabacsu       -----------------MKK----------------VTMLGAGSWGTALALVLTDNGNEV
                                                           .   ::*     .    

gpdhuman        A-QFDPRVTMWVFEEDIGGKKLTEIINTQHENVKYLPGHKLPPNVVAVPDVVQAAEDAD-
gpdarabit       A-QFDPRVTMWVFEEDIGGKKLTEIINTHQENVKYLPGHKLPPNVVAVPDVVKAAADAD-
gpdamouse       A-HFDPRVTMWVFEEDIGGRKLTEIINTQHENVKYLPGHKLPPNVVAIPDVVQAATGAD-
gpdadrome       P-EFEERVTMFVYEELIDGKKLTEIINETHENVKYLKGHKLPPN-VAVPDLVEAAKNAD-
gpdacaeel       PDEFDPTVRMWVFEEIVNGEKLSEVINNRHENIKYLPGKVLPNNVVAVTDLVESCEGSN-
gpd1yeast       PEVFAPIVQMWVFEEEINGEKLTEIINTRHQNVKYLPGITLPDNLVANPDLIDSVKDVD-
gpdaecoli       V--------LWGHDPEH-----IATLERDRCNAAFLPDVPFPDTLHLESDLATALAASR-
gpdahaein       H--------LWGHNPAH-----IAQMQTERQNYRFLPDVIFPEDLHLESNLAQAMEYSQ-
gpdpseu         -------------------------VPDPCENTAYLPGHPLPAALKATADFSLALDHVAQ
gpdabacsu       C--------VWAHRADL-----IHQINELHENKDYLPNVKLSTSIKGTTDMKEAVSDAD-
                                         :     *  :* .  :.      .:.  :      
... TRUNCATED

Note that clustal uses the following symbols below each alignment block to highlight conserved residues:
*   (asterisk)  positions that have a single and fully conserved residue
:   (colon)     conserved: conservation between groups of strongly similar properties (score > 0.5 on the PAM 250 matrix)
.   (period)    semi-conserved: conservation between groups of weakly similar properties (score ≤ 0.5 on the PAM 250 matrix)
    (blank)     non-conserved

Only the N-terminal part of the MSA is shown. Notice the long N-terminal extension of the yeast sequence. What do you think about the MSA around this extension?

4.3.2 Using seaview to visualize and edit MSAs

Seaview is a multiplatform, graphical user interface for multiple sequence alignment and molecular phylogeny.

  • SeaView reads and writes various file formats (NEXUS, MSF, CLUSTAL, FASTA, PHYLIP, MASE, Newick) of DNA and protein sequences and of phylogenetic trees.
  • SeaView drives programs muscle or Clustal Omega for multiple sequence alignment, and also allows to use any external alignment algorithm able to read and write FASTA-formatted files.
  • Seaview drives the Gblocks program to select blocks of evolutionarily conserved sites.
  • SeaView computes phylogenetic trees by parsimony, using PHYLIP’s dnapars/protpars algorithm, distance, with NJ or BioNJ algorithms on a variety of evolutionary distances, maximum likelihood, driving program PhyML 3.1.
  • Seaview can use the Transfer Bootstrap Expectation method to compute the bootstrap support of PhyML and distance trees.
  • Seaview uses the Treerecs method to reconcile gene and species trees.
  • SeaView prints and draws phylogenetic trees on screen, SVG, PDF or PostScript files.
  • SeaView allows to download sequences from EMBL/GenBank/UniProt using the Internet.

Seaview can be run from the command line, and the options can be found with:

# display the help menu
seaview -h

or here: seaview command manual

To visualize an alignment, simply call \(seaview\) with the alignment file name as single argument:

seaview 10_GDP_eucar_prokar.aln &
Visualization of an MSA with seaview
Visualization of an MSA with seaview

4.4 Multiple sequence alignment of protein-coding sequences (CDSs)

We will first attempt to perform a default MSA of nine full-length leuA sequences from diverse Bacillales using \(clustalw\)

Lets firs count the sequences, and have a look at them

# count sequences
grep -c '^>' leuA-Bacillales.fas

# display file
cat leuA-Bacillales.fas
9
>B_anthracis_Ames 30261500
atgaagcaaattttgttcatggatacgacgctacgtgatggcgaacaatcaccaggagtaaatttaaatgaacaagagaa
attgcaaattgcaaggcaattagagagactagggattcatgtcatggaagctggctttgcagcagcttccgaaggggatt
ttcaatcggtaaagcgcatcgcaaatacaattcaaaatgctacagttatgagtttagcaagagcgaaagaaagtgatatt
cgaagagcatatgaagctgtgaaaggtgcggtatctcctcgtttacacgtatttttagctacgagtgatattcatatgaa
gtataaactttgtatgtcaaaagaagatgtattagatagtatttatcgatcggttacacttgggaaatcgttatttccga
cagtacaattttcagcagaagatgcaacaagaacagcaagagattttttagctgaagcggtagaagttgcgattcgtgca
ggagcgaatgtaattaatattccagatacagttgggtatacgaatccagaagaatattatgctctttttaaatatttaca
agaatccgttccttcatatgaaaaagcaattttctcttgtcattgtcatgatgatcttgggatggcggtagcaaattcat
tagctgcagttgaaggcggagcgttgcaagtagaaggaacaattaatgggattggagaaagagcagggaatgcggcgtta
gaagaggttgcagttgctcttcatattaggaaagatttctataaagcagagccttctatgacgctaaaagaaattaaagc
gacaagtacattagtaagcagattaacgggtatggttgtatcgaaaaataaagcgattgttggagcaaatgcattcgctc
atgaatcaggcattcatcaagatggtgtattaaaagaagtgacaacatatgaaattattgaaccagcgctagtaggcgaa
tctcaaaatctatttgtacttggaaaacattctgggcgtcacgcgtttacagaaaaaatgaaagaacttggctatgaatt
tacaaacgacgagcgggatgcagtatttgaagcatttaaaaaattagctgatcgtaagaaggaaattactgaggaagatt
tacgagcacttatgcttggtgaagcagcatttgcggctcagcaatataacattacgcaattgcaagtacatttcgtatca
aatagtacacaatgtgcgacagttgtactaaaagatgaggaaggaaatgtattcgaagatgcagctactggctctggtag
tattgaagcgatttataatgcaatccaaagaattttaggattggaatgtgaattggcggattatcgcatacaatctatta
cacaaggtcaagatgcactggctcatgttcatgttgaattaaaagaaggagcacatcaagtatcaggttttggtgttgca
caagacgtattggaagcgtcggcaagagcatatgtccacgcagcggggaaattgaaatcttttatccagcttgtgaaata
a
... truncated

4.4.1 Default alignment of CDSs (the incorrect way of doing it!)

Align them under default parameters:

# generate the MSA
clustalw -infile=leuA-Bacillales.fas -outfile=leuA-Bacillales.aln

# display the MSA
cat leuA-Bacillales.aln
... truncated
B_anthracis_Ames                TGTTGAATTAAAAGAA--GGA-GCACATCAAGTATCAGGTTTTGGTGTTG
B_thuringiensis_konkukian       TGTTGAATTAAAAGAA--GGA-GCACATCAAGTATCAGGTTTTGGTGTTG
B_cereus_ATCC14579              TGTTGAATTAAAAGAA--GGT-AGTCACCAAATATCAGGTTTCGGAGTTG
B_licheniformis_ATCC_14580      CGTCAGGGTG-ATGAT--CGATGGAAAAGAATCGGGAGGACGCGGGGTTG
B_subtilis                      TGTAAGAGTT-TTGGT--GAACGGAAAAGAATCAGCAGGGCGGGGCATAG
B_halodurans                    TGTTCAAATGGATTAT--CAA-GGAGTCGTCTCAAGTGGACGCGGCACGG
O_iheyensis                     TGTTCAAATT-ATTAT--TAATGGAGAAACAATCAATGGTAGAGGGTCTG
L_innocua                       TGTCGTCATTAAAGATGATAATGGCACTGAATTCCATGGTATCGGTATCG
L_monocytogenes_4b_F2365        TGTCGTCATTAAAAATGACAAAGGCGCTGTGTTCCATGGTATCGGTATTG
                                 **     *                            **    **    *

B_anthracis_Ames                CACAAGACGTATTGGAAGCGTCGGCAAGAGCATATGTCCACGCAGCGGGG
B_thuringiensis_konkukian       CACAAGACGTATTGGAAGCGTCGGCAAGAGCATATGTCCACGCAGCGGGG
B_cereus_ATCC14579              CGCAAGACGTATTGGAAGCGTCGGCGAGAGCTTATGTCCATGCAGCAGGG
B_licheniformis_ATCC_14580      CCCAGGATGTTCTCGAAGCATCTGCCAAAGCCTACTTGAATGCTGTGAAC
B_subtilis                      CGCAAGACGTATTAGAAGCATCAGCGAAAGCCTATTTGAACGCAGTAAAC
B_halodurans                    CCCATGATGTGTTAGAAGCGTCAGCAAAAGCCTATTTAAACGCAGTTAAT
O_iheyensis                     CTCAGGATGTGCTTCAAGCGTCTGCAGTAGCATTCATTCAGGCAGTAAAT
L_innocua                       ATTTTGATGTTTTAACAGCTAGTGCTAAAGCTTATTTGCAAGCATCTGGA
L_monocytogenes_4b_F2365        ATTTTGACGTTTTAACTGCTAGTGCTAAAGCTTATTTGCAAGCATCAGGC
                                     ** **  *    **    **   *** *   *  * **       

B_anthracis_Ames                AAATTGAAATCTTTTATCCAGCTTGTGAAATAA-----------------
B_thuringiensis_konkukian       AAATTGAAATCTTTTATCCAGCTTGTGAAATAA-----------------
B_cereus_ATCC14579              AAATTAAAATCTCTTATAGCGCTTGTGAAATAA-----------------
B_licheniformis_ATCC_14580      CGCTATCTCGTATTGAAGACGAATACGGAAGGATTGTCAAAACAGGCGGC
B_subtilis                      CGTCAATTGGTTTTCCAGTCGAATATGAGCGGATTGAAAAACCACACAGC
B_halodurans                    CG--AACGATTAATCGAAAGAAATATGCAGAAGTTTATGGATCGAA----
O_iheyensis                     CGTTATTACGT--TCAAAAGAAA-----GCAAATC--TAACTCGACTTGT
L_innocua                       AAAAGCAAATCTACAAGTAAGCA----AGCTGATTTCGAGGAGGTAAAGT
L_monocytogenes_4b_F2365        AAAAGCAAAACCGCCAGTAAGCA----AGCTGATTTCGAGGAGGTAAAAT
... truncated

What do you notice from the output shown above of the leuA-Bacillales.aln MSA?

4.4.2 The correct way of aligning CDSs

The correct mode of aligning CDS requires their translation and alignment at the protein level.

4.4.2.1 Tranlate CDSs to proteins with the aid of translate_fastas.pl

Here we assume that all CDSs in the FASTA file are in the same+1 frame.

translate_fastas.pl -e fas -t 11

Lets visualize the translated products:

cat leuA-Bacillales_translated.faa
>B_anthracis_Ames 30261500
MKQILFMDTTLRDGEQSPGVNLNEQEKLQIARQLERLGIHVMEAGFAAASEGDFQSVKRI
ANTIQNATVMSLARAKESDIRRAYEAVKGAVSPRLHVFLATSDIHMKYKLCMSKEDVLDS
IYRSVTLGKSLFPTVQFSAEDATRTARDFLAEAVEVAIRAGANVINIPDTVGYTNPEEYY
ALFKYLQESVPSYEKAIFSCHCHDDLGMAVANSLAAVEGGALQVEGTINGIGERAGNAAL
EEVAVALHIRKDFYKAEPSMTLKEIKATSTLVSRLTGMVVSKNKAIVGANAFAHESGIHQ
DGVLKEVTTYEIIEPALVGESQNLFVLGKHSGRHAFTEKMKELGYEFTNDERDAVFEAFK
KLADRKKEITEEDLRALMLGEAAFAAQQYNITQLQVHFVSNSTQCATVVLKDEEGNVFED
AATGSGSIEAIYNAIQRILGLECELADYRIQSITQGQDALAHVHVELKEGAHQVSGFGVA
QDVLEASARAYVHAAGKLKSFIQLVK*
>B_cereus_ATCC14579 30019549
LFMDTTLRDGEQSPGVNLNEQEKLQIARQLERLGIHVMEAGFAAASEGDFQSVKRIANTI
QNATVMSLARAKESDIRRAYEAVKGAVSPRLHVFLATSDIHMKYKLCMSKEDVLNSIHRS
VTLGKSLFPTVQFSAEDATRTSRAFLAEAVEIAIRAGANVINIPDTVGYTNPEEYYSLFE
YLQESVPSYEKAIFSCHCHDDLGMAVANSLAAVEGGALQVEGTINGIGERAGNAALEEVA
VALHIRKDFYKAEPSMTLKEIKATSILVSRLTGMVVPKNKAIVGANAFAHESGIHQDGVL
KEVTTYEIIEPALIGESQNLFVLGKHSGRHAFTEKMKELGYEFTNEERDAAFEAFKKLAD
RKKEITEEDLRALMLGEAAFAAQQYNITQLQVHFVSNSTQCATVVLKDEEGNVYEDAATG
SGSIEAIYNAIQRILGLECELADYRIQSITQGQDALAHVHVELKEGSHQISGFGVAQDVL
EASARAYVHAAGKLKSLIALVK*
... truncated

Note that the stop codons were translated as ’*’ symbols. These are not aminoacids and should be removed. This can be easily achieved with a simple \(sed\) one-liner:

# use sed to replace the terminal * symbols by nothing
sed 's/\*$//' Bacillales_translated.faa > k && mv k Bacillales_translated.faa
# visualize the edited translation products
cat leuA-Bacillales_translated.faa
>B_anthracis_Ames 30261500
MKQILFMDTTLRDGEQSPGVNLNEQEKLQIARQLERLGIHVMEAGFAAASEGDFQSVKRI
ANTIQNATVMSLARAKESDIRRAYEAVKGAVSPRLHVFLATSDIHMKYKLCMSKEDVLDS
IYRSVTLGKSLFPTVQFSAEDATRTARDFLAEAVEVAIRAGANVINIPDTVGYTNPEEYY
ALFKYLQESVPSYEKAIFSCHCHDDLGMAVANSLAAVEGGALQVEGTINGIGERAGNAAL
EEVAVALHIRKDFYKAEPSMTLKEIKATSTLVSRLTGMVVSKNKAIVGANAFAHESGIHQ
DGVLKEVTTYEIIEPALVGESQNLFVLGKHSGRHAFTEKMKELGYEFTNDERDAVFEAFK
KLADRKKEITEEDLRALMLGEAAFAAQQYNITQLQVHFVSNSTQCATVVLKDEEGNVFED
AATGSGSIEAIYNAIQRILGLECELADYRIQSITQGQDALAHVHVELKEGAHQVSGFGVA
QDVLEASARAYVHAAGKLKSFIQLVK
... truncated

4.4.2.2 Align the translated products keeping the original input order in the output MSA

Note that it is critical to conserve the original ordering of the sequences from the input file in the output MSA using the \(-OUTORDER=input\) option.

clustalw -help
clustalw -infile=leuA-Bacillales_translated.faa -outfile=leuA-Bacillales_translated_clu.faa -output=fasta -OUTORDER=input

4.4.2.3 Use the prot2cdnAlns.pl perl script to reconstruct the CDS MSA using the translation product MSA as guide alignment.

prot2cdnAlns.pl -h
prot2cdnAlns.pl leuA-Bacillales.fas leuA-Bacillales_translated_clu.faa

4.4.2.4 Compare the leuA-Bacillales_cdnaln.fas codon alignment with the default leuA-Bacillales.aln MSA

Let us display both the codon and default MSAs of the leuA CDSs:

seaview leuA-Bacillales_cdnaln.fas &

seaview leuA-Bacillales.aln &
Comparison of the codon and default MSAs of the leuA CDSs
Comparison of the codon and default MSAs of the leuA CDSs

The output shown above clearly demonstrates that when aligning CDSs, we should first translate them to proteins, perform the alignment of the translation products, and finally reconstitute the CDS alignment using the protein MSA as a guide.


5 MSA with clustal\({\Omega}\)

Clustal\({\Omega}\) is the last incarnation of the Clustal family of MSA programs. It offers a significant increase in scalability over previous versions thanks to the implementation of novel algorithms that avoid bottlenecks of the classic progressive multiple sequence alignment algorithms (Sievers et al. 2011).

5.1 What are the differences between Clustalw and Clustal-omega?

Clustalw and earlier Culstal versions use fast word based alignments for these comparisons which made it memory efficient and fast enough to work on PCs and Macintosh computers. However, as the number of sequences grows above a few thousand, the \(O(N^2)\) complexity associated with the computation of pairwise distance matrices to generate “guide trees” becomes very time consuming and makes very big alignments difficult to make.

  • \(Clustal\Omega\) implements an \(O(NlogN)\) method called \(mBed\) (Sievers et al. 2011) that allows guide trees of hundreds of thousands of sequences to be made by restricting the calculation of sequence alignment scores to NLog(N). This mBed method is what is used in Clustal Omega to give capacity and scalability for very large datasets.

  • The second main development in Clustal Omega was to use an alignment engine for aligning profile hidden Markov models (HMMs) to each other instead of the conventional dynamic programing and profile alignment. We used \(HHalign\) (Söding 2005) which had been shown to have very high accuracy for profile HMM alignment. This gives greatly increased accuracy to Clustal Omega when compared to earlier Clustal programs, as measured on structure based alignment benchmarks. Only a small amount of original code from the earlier Clustal programs was used for the new program: the fast word-based pairwise alignment routines. The rest of the code was coded new from scratch or taken from publically available libraries.

The paper describing Clustal\({\Omega}\) is the following one: Sievers et al. 2011. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol Syst Biol. 2011 Oct 11:7:539. doi: 10.1038/msb.2011.75

ABSTRACT Multiple sequence alignments are fundamental to many sequence analysis methods. Most alignments are computed using the progressive alignment heuristic. These methods are starting to become a bottleneck in some analysis pipelines when faced with data sets of the size of many thousands of sequences. Some methods allow computation of larger data sets while sacrificing quality, and others produce high-quality alignments, but scale badly with the number of sequences. In this paper, we describe a new program called Clustal Omega, which can align virtually any number of protein sequences quickly and that delivers accurate alignments. The accuracy of the package on smaller test cases is similar to that of the high-quality aligners. On larger data sets, Clustal Omega outperforms other packages in terms of execution time and quality. Clustal Omega also has powerful features for adding sequences to and exploiting information in existing alignments, making use of the vast amount of precomputed information in public databases like Pfam.

The culstal omega webpage
The culstal omega webpage

5.2 Default alignment of multiple FASTA files with clustal-omega using a for loop

Here we will repeat the MSA of the twoFASTA files containing four and six bacterial and eukaryotic GDP protein sequences using clustalo called from a Bash for loop.

# call clustalo within a for loop to align all *ualn.faa files found in the working directory
for file in *ualn.faa; do clustalo -i $file -o ${file%.*}_cluOaln.phy --outfmt phy --output-order input-order; done

5.3 Profile to profile (p2p) alignment

clustalo --profile1 4_GDP_procar_ualn_cluOaln.phy --profile2 6_GDP_eucar_ualn_cluOaln.phy --infmt phy --outfmt clu -o 10_GDP_eucar_prokar_cluOp2p.aln 

5.4 Sequences to profile (s2p) alignment

clustalo --profile1 6_GDP_eucar_ualn_cluOaln.phy -i 4_GDP_procar_ualn.faa -o 10_GDP_eucar_prokar_cluOs2p.aln --outfmt clu

5.5 Comparison of the p2p and s2p 10_GDP_eucar_proka alignments

paste 10_GDP_eucar_prokar_cluOp2p.aln 10_GDP_eucar_prokar_cluOs2p.aln
pasted p2p vs. s2p GDP alignments
pasted p2p vs. s2p GDP alignments

Several differences can be seen between the p2p and s2p GDP alignments shown above. Which one is the best?

To gain evidence in favor of one or the other, we will perform a new \(clustalo\) alignment on the full 10 GDP sequence set using iterations and Kimura correction.

5.6 Alignment of the 10 GDP sequences with clustal-omega using iterations and Kimura correction

Iterations can be used in an attempt to improve the alignment by a process of repeated refinement (Sievers & Higgins, 2014). Iterative refinement can be accomplished in two ways: (i) guide-tree refinement and/or (ii) refinement using the multiple alignment output as an external profile HMM. This process is illustrated in the bottom part of Figure shown below (Sievers & Higgins, 2014).

The second iteration scheme uses a profile HMM, derived from an alignment of the input sequences, as an external profile. The first alignment gives an outline of the positions where residues are more likely to align in a second round of alignment.

Both iteration methods can be carried out a number of times, separately or in combination. The default is to combine them, as will be shown in the protocol below.

# 1. combine the source FASTA files into a single one named 10_GDP_procar_eukar_ualn.fas
cat 4_GDP_procar_ualn.faa 6_GDP_eucar_ualn.faa > 10_GDP_procar_eukar_ualn.fas

# 2. Perform clustalo MSA with two iterations and using Kimura-corrected distances
clustalo -i 10_GDP_procar_eukar_ualn.fas --iter 2 --use-kimura -o 10_GDP_eucar_prokar_cluoAln2iter.aln --outfmt clu

Clustal Omega will first construct a guide tree and align the sequences in the order specified by the guide tree. This alignment is then used in two ways:

  1. A new guide tree is constructed, based on distances derived from the aligned sequences
  2. The alignment is converted into a profile HMM

During the second round of alignment (the first iteration), sequences are aligned to the internally constructed profile HMM, pseudo-count information is transferred to the unaligned sequences, and the ‘enhanced’ sequences are aligned to each other in the order specified by the newly constructed guide tree. The final alignment is written in FASTA format to outfile.fa .

The first iteration will be followed by another guide-tree construction and HMM pseudo-count transfer.

While the alignment accuracy is expected to be higher for the iterated alignments, so is also the response time. Every round of iteration can take up to three times longer than the initial alignment. For example, if the initial alignment took 1 min, then an alignment that was iterated twice might take up to 7 min; that is, 1 min for the initial alignment and then two times 3 min for the iterations.

Unfortunately, experience has shown that the alignment accuracy cannot be improved indefinitely by iteration. Good improvements are often achieved with one or two iterations (Sievers & Higgins, 2014).

Let us compare the standard and iterated alignments of the 10_GDP_eucar_prokar sequences:

paste 10_GDP_eucar_prokar_cluoAln2iter.aln 10_GDP_eucar_prokar_cluOp2p.aln
p2p vs. iteration alignment with Kimura distances
p2p vs. iteration alignment with Kimura distances

As shown above, there are minimal differences in the p2p vs. the iteration alignment with Kimura distances performed on the full set of 10 GDP sequences. Can you spot them?

Based on the high similarity of both alignments, we might conclude that the p2p or full alignment with two iterations are better than the s2p alignment presented in a previous section.

5.7 CDS alignment of leuA-Bacillales.fas using clustalo

As discussed previously, CDSs should be aligned using their aligned translation products as a guide.

# 1. Translate the leuA-Bacillales.fas file with the aid of the translate_fastas.pl Perl script
translate_fastas.pl -e fas -t 11 

# 2. Remove the trailing *'s resulting from the translation of stop codons present in the DNA sequences
sed 's/\*$//' Bacillales_translated.faa > k && mv k Bacillales_translated.faa

# 3. Align the tranlation products keeping the input order
clustalo -i leuA-Bacillales_translated.faa -o leuA-Bacillales_translated_cluo.faa --output-order input-order

# 4 Use the prot2cdnAlns.pl Perl script to generate the codon alignment using the protein alignment as a guide
prot2cdnAlns.pl leuA-Bacillales.fas leuA-Bacillales_translated_cluo.faa

5.7.1 Visualize the codon alignment with seaview

Finally, we will visualize the resulting codon alignment of the leuA CDSs at the DNA and protein levels.

seaview  leuA-Bacillales_cdnaln.fas &
clustalo codon alignment of leuA CDSs
clustalo codon alignment of leuA CDSs
clustalo codon alignment of leuA CDSs viewed as proteins
clustalo codon alignment of leuA CDSs viewed as proteins

5.8 Multiple sequence alignment using an external profile HMM

As mentioned previously, a key feature of \(Clustal\Omega\) is that it can take an external profile hidden Markov model of a protein family and use it to align highly divergent members. As mentioned by Sievers & Higgins (2014), the main problem with progressive multiple sequence alignment is that alignments that were made during the early stages of the MSA cannot be changed at a later stage. This has been summarized by the phrase “once a gap, always a gap”.

Ideally one would like to have a degree of ‘foresight’ and know in advance which alignment columns are conserved and important in the alignment Sievers & Higgins (2014). A limited amount of foresight can be communicated to Clustal Omega if an alignment of sequences that are homologous and alignable to those in the input sequence set has already been constructed. Such alignment information can be conveniently described in the form of a profile hidden Markov model (profile HMM).

Use of profile HMMs is particularly natural within Clustal Omega, as its main alignment algorithm is based on alignment of profile HMMs (Söding, 2005). A profile HMM can be input into Clustal Omega as an ‘external profile’, which is used to guide the alignment of the unaligned sequences in the input file. The alignment of the unaligned residues will be guided by the distribution of residues and gaps in the HMM. Residues of the unaligned sequences will be aligned to the positions within the HMM that best fit the signature, and this is used to align those residues more accurately.

Experience has shown (Sievers et al., 2011) that external profile HMM information is rarely detrimental. Use of HMMs as external profiles is therefore very robust. However, its use is particularly helpful in the alignment of highly diverged sequences with low average pairwise sequence similarity. Such scenarios frequently lead to early misalignments and/or gaps, which can be partly avoided using background HMM information.

This procedure can be used if an existing multiple alignment of a set of sequences that has been carefully maintained is available. It can then be converted to a HMM and used to help align new sequences. Alternatively, HMMs can be downloaded from alignment databases such as PFAM. This procedure is also used to iterate the multiple alignment process, as described in the previous section.

5.8.1 Alignment of 146 human Ras GTPases using the PFAM00071 HMM

As a final exercise, we will align the classic set of 146 human Ras GTPases compiled by Wennerberg et al. (2005) with the aid of PFAM00071.hmm model for the Ras superfamily fetched from PFAM. Note that on the chaac server the alignment takes ~20 minutes to complete using 10 cores.

time clustalo -i Human_Ras_Superfam_proteins_Wennerberg_JCS05.faa --hmm-in Ras_fam_PF00071.hmm -o Human_Ras_Superfam_proteins_Wennerberg_JCS05_PF00071.aln --outfmt clu
real    28m5.611s <== almost half an hour!
user    2656m6.083s
sys 0m10.238s

Below is a screenshot of the central part of the resulting alignment, the region modeled by the PFAM00071 model, for a fraction of the 146 sequences, and viewed with seaview.

seaview Human_Ras_Superfam_proteins_Wennerberg_JCS05_PF00071.aln &
Ras superfamily clustal-omega alignment using PFAM00071
Ras superfamily clustal-omega alignment using PFAM00071

The alignment shown above was used to compute the maximum likelihood tree shown below for the family (Vinuesa, unpubl.)

Maximum likelihood phylogeny of 146 human Ras superfamily members. Vinuesa unpubl.
Maximum likelihood phylogeny of 146 human Ras superfamily members. Vinuesa unpubl.

5.8.2 Motif and logo analysis of highly conserved alignment regions of the Ras superfamily

Seaview can be used to easily extract certain alignment regions or columns. These can then be used to compute the sequence motif and logos that summarize the residue frequency distribution using ad hoc Perl and R scripts written by the author.

Below the logo and motif of the first highly conserved (G1) region of Ras GTPases

Alignment and sequence logo of the highly conserved G1 region of human Ras GTPases
Alignment and sequence logo of the highly conserved G1 region of human Ras GTPases

A simpler, less informative way to represent these highly conserved signatures is by a sequence motif, as shown below

# sequence motif for the highly conserved G1 region of Ras GTPases with max. degeneracy set to 4
xxxxxxxx[GAE]xxxx[GAV][KT][TSGK]xxxxxxxxx